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to determine the most important features in making a machine learning model. 
Furthermore, the Random Forest (RF) and Support Vector Machines (SVM) 
were the machine learning model used, with highest accuracies of 90% 
and 95% respectively. From the results obtained, the SVM is a better model 
than random forest in terms of accuracy. 
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1. INTRODUCTION 

Cancer is one of the deadliest diseases in the world. According to World Health Organization (WHO), 
in 2018, it is the second leading cause of death globally and responsible for approximately 9.6 billion deaths 
in 2018. There are over 100 different type of cancer that affect human. However, this study, aims to analyze 
the breast cancer, a disease in which cells grow out of control to form a tumor which tends to affect another 
part of the body. There are three common parts of breast whose cells has the ability to turn into cancer namely 
lobules, ducts, and the connective tissue. Lobules are glands which produces milk while the ducts are thin tubes 
that carry milk away from the lobule. The connective tissue consists of fibrous and fatty tissues which holds 
the breast and gives it shape and size. However, in most cases, it begins in the lobules or ducts. 

The exact causes of breast cancer are still not known, but experts are of the opinion that an interaction 
between genes with lifestyle, environment, and hormone, tends to provoke abnormal cell growth. 
There are several factors that increase the risk of getting breast cancer such as age. According to research, most 
cases people are diagnosed after the age of 50. Men still have a risk of getting breast cancer even though 
it is a lot lower than women. Someone who had early menstrual periods before the age of 12 and starting 
menopause after 55, stand a higher risk of being affected. Radiation therapy is also another factor which makes 
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cell grows abnormally. Furthermore, women have a higher risk of getting breast cancer assuming her first 
relative (mother, daughter, or sister) was diagnosed with it, which in most cases is unchangeable. 
There are also some factors for instance overweight women after menopause stand a higher risk than those 
with normal body weight. Care should therefore be taken by those with increased hormones after menopause, 
as it raises the risk of getting affected by breast cancer. When someone has all the above mentioned factors, 
doesn’t mean they are sufferers, and vice versa. 

Symptoms of breast cancer differ from persons. However, some common symptoms include skin 
changes, such as swelling, redness, visible differences in one or both breasts, appearance of a lump which 
doesn’t go away after some period, feeling of pain or burning sensation around the breast area even with 
no pressure, a change in the nipple, itches, etc. Once you come across any of these symptoms, 
consult a doctor immediately. 

The treatment for breast cancer is different and dependent on the type, the tumor size, and how far 
it has spread in the body (stage of the cancer). The most common treatment method is surgery, which is used 
to remove the tumor and tissues known as lumpectomy or the whole breast called mastectomy. In addition, 
once the cancer has already spread in the body, the common treatment 1s radiation therapy, the intention is to 
kill its cells using high energy waves. The other way to kill cancer cells is Chemotherapy, which is the use 
of drugs, however, this treatment also has its side effect such as hair loss, early menopause, and fatigue. 
The use of medicine to prevent hormones, especially estrogen, also works as a treatment. But sadly, currently 
there is no cure for cancer completely. Therefore, the sooner the better to know someone is suffering cancer 
or not, so it can be treated early. 

Many machine learning methods have been applied for breast cancer classification, such as Support 
Vector Machines [1] and Network-based [2]. However, this research compares both in terms of accuracy. SVM 
is already a widely known method used for classification such acute sinusitis [3], face identification [4], 
predicting bank failure [5], Intrusion Detection System [6, 7, 8], Classification of Schizophrenia [9], 
Detection of Traffic Incident [10], and Face Recognition [11, 12]. Some previous studies utilized random forest 
for gene selection and classification [13], classification of android malware [14], predict bank 
failure [15], predict prostate cancer [16], and osteoarthritis classification [17]. This research 1s expected to help 
the health sector to classify breast cancer sufferers. 


2. RESEARCH METHOD 
2.1. Data 
The data in this study was taken from UCI machine learning repository [18]. The data consists of nine 
features, as follows: 
— Age (Years) 
— BMI (kg/m?) 
— Glucose (mg/dL) 
— Insulin (uU/mL) 
— HOMA 
— Leptin (ng/mL) 
— Adiponectin (ug/mL) 
— Resistin (ng/mL) 
— MCP 1 (pg/mL) 
There are two classification class, with 116 observations consisting of 52 healthy (1) and 64 
patients (2). 


Table 1. Data of breast cancer from UCI machine learning repository 


Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin MCP 1 Class 
76 ZT 110 26.2 ALI 21.78 4.93 8.49 45.84 1 
77 25.9 85 4.58 0.96 13.74 9.75 11.77 488.82 1 
45 21.3 102 13.852 3.48 7.64 21.05 23.03 552.44 2 
45 20.83 74 4.56 0.83 (ee) 8.23 28.03 382.95 2 


2.2. Supervised learning 
Supervised learning is a method that provides discrete prediction, called classification. It split the data 
into training, which is used to predict model to obtain the best parameter and the test data where the obtained 
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results are applied. Supervised learning keeps updating itself to make the best model possible, and by using it, 
a new data has the ability of being inputted and classified. 


2.3. Decision tree 

Decision tree is a model diagram that consists of node and edge. There are several types of nodes such 
as, root, parent (internal), and child nodes (leaf). Root node is the beginning of a node that makes another 
branch, known as the parent node. This makes another branch known as child node, which consists of nght 
and left nodes. Furthermore, when the child node doesn’t have any branch, it’s called terminal node. Figure 1 
shows a simple decision tree consists of root node, internal node and leaf node. 





internal 
node 





Figure 1. Diagram of decision tree consists of root node, internal node, and leaf node 


Decision tree is one of the examples of machine learning model, which is easy to understand because 
it is visualized. It follows the rule of Boolean algebra. Tree is made of binary recursive process of the whole 
data so the variable from each data will be homogenous [19]. Making a decision tree model involve 
these processes [19]: 

— Split the parent node into child node with goodness of split criterion 
— State the terminal node which is the last node of the decision three 
— Class assignment 

Algorithm for making the decision tree model is using divide and conquer recursively [20]. 
The algorithm follows the steps below: 

Choose a feature to be named as root node and make branch for all possible feature. 
— Divide the training set with one set for a branch. 

Recursively repeat the process for every branch using its data. 

Stop when all the data has same class. 


2.4. Boruta feature selection 

Boruta feature selection is built around the random forest classification algorithm, which is carried 
out without tuning of parameters and numerical estimate of the important feature. Random forest 
is a classification method which is performed by voting of multiple unbiased decision trees built from samples 
of the training set [21]. The importance of the feature is obtained from the loss of accuracy of classification. 

In addition, the decision of a tree isn’t influenced by another in the forest. The Boruta Feature 

Selection algorithm [22]: 

— Build extended system information by adding copies of all feature randomly permuted across objects. 

— Shuffle the system to minimize the correlations with the response 

— Perform random forest to obtain the Z scores computed. 

— Determine the maximum Z score for the extended feature and assign the feature assuming it has a better 
score than the extended. Furthermore, run two-sided test of equality using the Z score for the extended 
feature of each attribute with undetermined importance. 

— Label the feature with lower Z score for the extended ‘unimportant’ feature and remove it from the system. 
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— Label the feature with higher Z score for the extended ‘important’ feature. 
— Remove all copied feature and repeat the procedure till none is removed 


2.5. Random forest 

Random forest was first introduced by Ho in 1995 to split nodes. It is the ensemble of many decision 
trees using bootstrapping and random feature selection. Random forest is suitable for this study because 
it performs well on large datasets. Figure 2 shows a diagram of random forest which built of many 
decision tree. 
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Figure 2. Diagram of random forest which consists of many decision trees 


Random forest is a classifier consisting of classification tree {T(w,v;),i = 1, ... , I} where {v;} 
is a vector with independently and identically distributed with each tree vote at the input w. The accuracy from 
decision tree is more stable and accurate. However, the random forest performs better accuracy which makes 
the correlation of tree significant. By forming a lot of tree into random forest the risk of getting over fitting 
is reduced, with the error and converge into some value generalized. Given ensemble classifier 
T, (x), T>(a),...,T;(a) with random training data obtained from vector X and Y, the function 
of the margin is written as: 


mg(X,Y) = av, P(T;(X) = Y) — max av; (T;(X) =J) (1) 


Where F is the indicator function and a1, the average, with T; (X) = F the result of classification, where Y 
is the class prediction and T;(X) = j is the result of classification with j. Margin is used to determine 
the average value of votes X, Y. A greater margin value means a more accurate classification. 


E = Wyy(mg(X,Y) < 0) (2) 
È denotes the generalization error, while Wy y indicate that the probability is more than X, Y dimension [23]. 
X.Y 


2.6. Support vector machines 

Support Vector Machines is one of the supervised learning methods widely introduced by Vapnik in 
the late 90s. During its early days, it was used only for classification, however, it has developed, and capable 
of solving regression problems. SVM try to solve the classification problem by forming a hyperplane which 
maximizes the margin by dividing the data into classes. The nearest distance from the hyperplane to the point 
of each class is known as margin. Figure 3 1s an illustration of SVM model. 

Given a dataset P = {x y), aes a l E EX, ym, EF = f{—1,1}, SVM try to solve 
the following equation: 


_ Í Rr. 
minz ||6|| (3) 


TELKOMNIKA Telecommun Comput El Control, Vol. 18, No. 2, April 2020: 815 - 821 


TELKOMNIKA Telecommun Comput El Control O 819 
¥u(G' x, + b)> 1, u= 1,2,..,0 

For some error cases, parameter 5 => 0 and slack variable & = 0 was added to the equation: 
min=|I@l| +s 2 (4) 
Gb 2 = 


y, (Gx, + b) > 1— F u = 1,2 E t. 


Kernel function is used for some problem which can’t be solved using linear hyperplane. Its function 
is defined as: 


u(x;,x;) = pæ) eC) (5) 
These are the most common kernel function [24, 25): | 
- Linear H EXA =at 
` ` 2 
- Radial Basis Function u(x,,x;) = exp (-v | |x; — x; | E = 0 
- Polynomial i n(x; ) = (xT x; + 1) ,m > 0 





Figure 3. SVM solve classification problem by performing a hyperplane which maximize 
the margin of the data 


2.7. Model performance validation 

This study applied Hold-Out Validation to validate the performance of the model. It functions 
by splitting the data into training and testing data with the model built from the training data and tested with 
the testing data. A different percentage of training data was applied, to overcome the weakness of Hold-Out 
Validation, which was performed nine times with different percentage of the training data utilized. 
The performance of the model is obtained from the accuracy with the formula written as below: 


, l 7 TP+TN (9) 
ccuracy = TP+TN+FN+FP 


For TP, TN, FP, FN represents True Positive, True Negative, False Positive, and False Negative. 


3. RESULTS AND ANALYSIS 
This study used software Rstudio version 1.1.463 for both random forest and support vector machines. 


3.1. Result of boruta feature selection 

The result of Boruta Feature Selection determines whether the feature is important or not as shown 
in Table 2. According to the result shown on the table, the important features are Age, BMI, Glucose, HOMA, 
and Resistin. However, the features which are labeled important to make the machine learning model 
are random forest and support vector machines. 
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Table 2. Result of boruta feature selection 


Feature Result 
Age Important 
BMI Important 

Glucose Important 

Insulin Unimportant 

HOMA Important 

Leptin Unimportant 

Adiponectin Unimportant 

Resistin Important 

MCP. 1 Unimportant 


3.2. Breast cancer classification using random forest 
According to the result shown on the Figure 4, the model performed best with 80% of training data, 


resulting in an accuracy of 90.90%. Conversely, the worst accuracy was recorded at 74.75% with 10% 
training data. 
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Figure 4. Results of accuracy of breast cancer classification performed by random forest 


3.3. Breast cancer classification using support vector machines 

According to the result shown on the Figure 5, the model performed best using 80% of the training 
data which resulted in an accuracy of 95.45%. Conversely, the worst accuracy is recorded at 72.81% with 10% 
training data. 
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Figure 5. Results of accuracy of breast cancer classification performed by support vector machines with RBF 
kernel, parameter C = 1 and g = 0.328524 


4. CONCLUSION 

This study used Random Forest (RF) and Support Vector Machines (SVM) as the machine learning 
methods to classify breast cancer. Furthermore, the Hold-Out Validation was used to validate and evaluate 
the performance of the model, from the simulation for SVM with Radial Basis Function (RBF) kernel, with 
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C = 1 and g = 0.328524 as the best parameter of the model. According to the experiment result, RF scored 
the best accuracy at 90.90% using 80% training data, while SVM had better accuracy at 95.45% using 80%. 
These results show that the performance of SVM is better than RF in terms of accuracy. 
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